Understanding the Limits of Capacity Sharing in CMP Private Caches
نویسندگان
چکیده
Chip Multi Processor (CMP) systems present interesting design challenges at the lower levels of the cache hierarchy. Private L2 caches allow easier processor-cache design reuse, thus scaling better than a system with a shared L2 cache, while offering better performance isolation and lower access latency. While some private cache management schemes that utilize space in peer private L2 caches have been recently proposed, we find that there is significant potential for improving their performance. We propose and study an oracular scheme, OPT, which identifies the performance limits of schemes to manage private caches. OPT uses offline-generated traces of cache accesses to uncover applications’ reuse patterns. OPT uses this perfect knowledge of each application’s future memory accesses to optimally place cache blocks brought on-chip in either the local or a remote private L2 cache. We discover that in order to optimally manage private caches, peer private caches must be utilized not only at a local-cache-replacement time, as has been previously proposed, but also at cache-placement time it may be better to place a missed block directly into a peer L2 rather than the traditional approach of first bringing it into the local L2. We implement OPT on a 4-core CMP with 512KB, 8-way, private caches, across 10 carefully chosen, relevant, multiprogram workload mixes. We find that compared to a baseline system that does not employ capacity sharing across private caches, OPT improves weighted-speedup (performance metric) by 13.4% on average. Further, compared to the state of the art technique for private cache management, OPT improves weighted-speedup by 11.2%. This shows the significant potential that exists for improving previously proposed private cache management schemes.
منابع مشابه
Computer Science and Artificial Intelligence Laboratory Victim Migration: Dynamically Adapting Between Private and Shared CMP Caches
Future CMPs will have more cores and greater onchip cache capacity. The on-chip cache can either be divided into separate private L2 caches for each core, or treated as a large shared L2 cache. Private caches provide low hit latency but low capacity, while shared caches have higher hit latencies but greater capacity. Victim replication was previously introduced as a way of reducing the average ...
متن کاملVictim Migration: Dynamically Adapting Between Private and Shared CMP Caches
Future CMPs will have more cores and greater onchip cache capacity. The on-chip cache can either be divided into separate private L2 caches for each core, or treated as a large shared L2 cache. Private caches provide low hit latency but low capacity, while shared caches have higher hit latencies but greater capacity. Victim replication was previously introduced as a way of reducing the average ...
متن کاملBalancing Capacity and Latency in CMP Caches
The large working sets of commercial and scientific workloads stress the L2 caches of Chip Multiprocessors (CMPs). Some CMPs use a shared L2 cache, to maximize the onchip cache capacity and minimize misses. Others use private L2 caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals strive to balance latency and capacity, but use ...
متن کاملA Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private L2 Caches
For high-performance chip multiprocessors (CMPs) to achieve their maximum performance potential, an efficient support for memory hierarchy is important. Since off-chip accesses require a long latency, high-performance CMPs are typically based on multiple levels of on-chip cache memories. For example, most current CMPs support two levels of on-chip caches. While the L1 cache architecture of thes...
متن کاملUnderstanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis
Understanding multicore memory behavior is crucial, but can be challenging due to the cache hierarchies employed in modern CPUs. In today’s hierarchies, performance is determined by complex thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform simulation to sort out these interactions, but this can be costly ...
متن کامل